The original “2016 New Coder Survey” dataset consists of 113 variables. Most of these variables are answers to survey questions, though a few are computer-generated (e.g. respondent ID and survey start/end times). Over 15,000 observations (i.e. respondents) exist.
The str function output is long and messy, so I won’t print it here. Please consult Free Code Camp’s list of survey questions and possible answers. Boolean, numeric, and categorical types are the majority.
I created six new variables from existing variables:
ifelse statementscut function on HoursLearningThese new variables bring our total to 119 variables.
## [1] 15620 119
646 respondents answered “Data Scientist/Data Engineer” to the question: “Which one of these roles are you most interested in?”
## [1] 646 119
Additional comments are included where the results significantly differ from the full new coder survey dataset.
The univariate section mimics the structure of Free Code Camp’s Medium article for direct comparison of data science/engineering students and new coders in general. A few additional univariate plots are included to smooth the transition to the plots explored in the bivariate and multivariate sections.
CodeNewbie and Free Code Camp designed the survey, and dozens of coding-related organizations publicized it to their members.
Of the 646 developing data scientists and data engineers who responded to the survey:
## female
## 0.2447917
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 14.00 22.00 26.00 27.72 31.25 65.00 74
This average is 5 months longer than the full survey dataset.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 3.00 8.00 16.17 20.00 360.00 31
Logarithmically transforming the long tail data to better understand the distribution, programming experience peaks around one year.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 5.00 10.00 14.41 20.00 80.00 30
Compared to 40% for the full new coder survey, this is a bit shocking, but understandable given the demand for data scientists and engineers in industry.
The data-related subset has a longer time horizon than the full survey dataset, where 65% are applying within the next year.
The developing data scientists/engineers use Coursera, edX, and Udacity more frequently than new coders in general. These companies are have wider subject area scopes than the some of the programming-specific resources listed.
6% of new coders from the full survey dataset have attended a bootcamp.
Compared to 58% for the full new coder survey, the data-focused subset is more skewed towards graduate studies.
Diversity amongst majors is greater compared to the full survey, where Computer Science and Information Technology checked in at #1 and #2 with 17% and 5%, respectively.
Two thirds of new coders in general are currently working.
Employment fields are more spread compared to the full new coder survey, where 50% of respondents work in software development and IT.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 6000 25000 43600 48420 60000 200000 390
With data science/engineering being notoriously lucrative in 2016, some respondents are likely seeking higher wages.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 6000 40000 60000 61110 80000 200000 65
## has served in military
## 0.06501548
## has children
## 0.1346749
## financially supporting
## 0.03250774
## no spouse
## 0.2137405
## is underemployed
## 0.4705882
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 76000 150000 194400 240000 1000000 591
This average is $3k more than the full survey dataset.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 10000 20000 36880 45000 1000000 485
Removing the million dollar outlier, the distribution is much more clear with the majority of debt under $75k. I hope that outlier is a joke.
## has high speed internet
## 0.8573913
## is receiving disability benefits
## 0.02608696
There isn’t really a singular main feature of interest in the “2016 New Coder Survey” dataset. There are several smaller features, but nothing stands out like diamond price and its relationship to carat weight, cut, colour, etc. in the R diamonds dataset, for example. The diamonds dataset covers two time periods - the existence of the diamond pre-sale and post-sale, whereas the survey dataset only covers a single period - the early stages of an individual’s coding care.
If we could fast-forward several years and survey the same respondents, the main feature of interest might be career earnings (adjusted for cost of living, preferably) and/or self-reported career satisfaction. A predictive model using a combination of variables from the 2016 survey could then be built to estimate career success.
If the survey asked “Are you already working as a data scientist/engineer?” instead of “Are you already working as a software developer?”, that variable might also be a main feature of interest. Unfortunately, the answer to that question cannot be extracted from the existing variables.
Though there isn’t a main feature of interest, we can separate the respondents who did not answer “Data Scientist/Data Engineer” to the job role interest question (as we already have for those who did) and compare the two subsets using bivariate and multivariate plots.
I will also explore two smaller features, how many hours dedicated to learning per week and expected salary, using bivariate and multivariate plots.
There was a lot of long tail data. Most did not require transformation to view the details of the distribution. Programming experience was really positively skewed, however, and required log transformation to visually compare those with 3 months experience to those with 25 years.
The following operations were performed to tidy, adjust, or change the form of the data:
gather() to transform the data from a wide format to a long format. Then I transformed the long data into factor format, using the replicate function with the number of yeses as the multiplier. This data is used to create each category’s bar chart. The first five operations were performed so bar charts could be created, which wasn’t possible with the original data format. The “Americas” separation was performed for additional insight.
14974 respondents did not answer “Data Scientist/Data Engineer” to the question: “Which one of these roles are you most interested in?”
## [1] 14974 119
The next two plots are created using pairs.panels() from the psych package. They display a scatter plot of matrices (SPLOM), with bivariate scatter plots below the diagonal, histograms on the diagonal, and the Pearson correlation above the diagonal.
For the data science subset of the survey, all correlations are below 0.4, which supports my statement that no main feature exists. The strongest of the correlations are:
The phenomena revealed are intuitive, but not groundbreaking: you tend to make more money when you are older, you tend to expect your next job to have a high salary if your current one does, and expensive schooling tends to lead to higher income levels.
For the non-data science subset of the survey, all correlations are again below 0.4. Most of the correlations are within 0.1 of the data science subset, except for three:
Interesting. I bet the skew towards graduate studies for the data science subset plays a role here, where higher levels of student debt and higher salaries are expected.
Let’s return to the data science subset of the survey. One of the strongest correlations is between age and current salary.
The earnings vs. age trend isn’t maintained as these individuals prepare to transition to the data science/engineering field. Younger individuals appear willing to capitalize on lucrative data-related salaries and older individuals appear willing to take a pay cut to enter their new field of choice.
The variables on the x-axis in the boxplots below are in descending order in terms of number of respondents.
Since two agender, three genderqueer, and two trans respondents exist and males represent 75% of the subset, we can’t say much about who is most dedicated to learning. The medians for males and females (10 hours per week) are identical.
## Gender: male
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 6.00 10.00 15.09 20.00 80.00 12
## --------------------------------------------------------
## Gender: female
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 5.00 10.00 12.15 15.00 80.00 11
## --------------------------------------------------------
## Gender: genderqueer
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.00 13.50 20.00 25.67 35.00 50.00
## --------------------------------------------------------
## Gender: agender
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.0 4.5 7.0 7.0 9.5 12.0
## --------------------------------------------------------
## Gender: trans
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2 9 16 16 23 30
The medians for five of the six continents are identical (10 hours per week). The bulk of Asian students appear most dedicated to learning, with their 75th percentile approaching 25 hours per week. Africa may be suffering from a small sample size issue with only 11 respondents.
## ContinentCitizen: North America
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 5.00 10.00 14.39 20.00 80.00 12
## --------------------------------------------------------
## ContinentCitizen: Europe
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 6.00 10.00 14.41 20.00 50.00 4
## --------------------------------------------------------
## ContinentCitizen: Asia
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 5.00 10.00 15.47 23.75 56.00 4
## --------------------------------------------------------
## ContinentCitizen: Oceania
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.0 5.0 10.0 11.6 15.0 42.0
## --------------------------------------------------------
## ContinentCitizen: South America
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 4.00 7.75 10.00 15.06 16.25 40.00 1
## --------------------------------------------------------
## ContinentCitizen: Africa
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.500 6.000 8.727 15.000 21.000
Again, male and female medians are identical. They both expect around a $60k data science/engineering salary. There is a gap in first quartiles, however, as females expect $10k more than males.
## Gender: male
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 6000 40000 60000 60770 80000 200000 45
## --------------------------------------------------------
## Gender: female
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 6000 50000 60000 61560 80000 150000 14
## --------------------------------------------------------
## Gender: genderqueer
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 48000 54000 60000 59330 65000 70000
## --------------------------------------------------------
## Gender: agender
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 30000 37500 45000 45000 52500 60000
## --------------------------------------------------------
## Gender: trans
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 65000 66250 67500 67500 68750 70000
Whoa. Expected earning by continent varies way more compared to the above three boxplots. Most North Americans expect the highest range of salaries, with their interquartile range spanning from $55k to $80k. The 75th percentile for Europe is $5k below North America’s 25th percentile. I wonder if some European respondents forgot to convert from pounds or euros to USD. Expectations in Asia are all over the board.
A lot of these individuals are using similar, if not the same, online educational resources. Labour market economics can be cruel.
## ContinentCitizen: North America
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 12000 55000 65000 68420 80000 200000 22
## --------------------------------------------------------
## ContinentCitizen: Europe
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 6000 24000 40000 41960 50000 120000 23
## --------------------------------------------------------
## ContinentCitizen: Asia
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 6000 20000 50000 55470 86300 150000 7
## --------------------------------------------------------
## ContinentCitizen: Oceania
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 10000 40000 65000 65290 75000 160000 3
## --------------------------------------------------------
## ContinentCitizen: South America
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 12000 33000 45000 48730 55000 100000 2
## --------------------------------------------------------
## ContinentCitizen: Africa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 40000 62500 77500 85300 87500 200000 1
Salary expectations don’t vary much depending on hours dedicated to learning. Other than those who dedicate 40+ hours per week, an expected salary in the $40k to $80k range is standard.
##
## (0,10] (10,20] (20,40] (40,80]
## 351 136 101 19
## HoursLearningBucket: (0,10]
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 6000 40000 60000 61700 80000 200000 28
## --------------------------------------------------------
## HoursLearningBucket: (10,20]
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 6000 40000 60000 58380 75000 120000 10
## --------------------------------------------------------
## HoursLearningBucket: (20,40]
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 6000 40000 60000 58130 75000 200000 8
## --------------------------------------------------------
## HoursLearningBucket: (40,80]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 20000 50000 60000 66740 85000 135000
The data science/engineering subset of the survey is largely similar to the non-data science/engineering subset, except for three correlations involving student debt owed. I bet this has something to do with the skew towards graduate studies for the data-focused subset.
The correlation between current salary and age is stronger than expected earning for the individual’s first data science/engineering job and age.
Hours dedicated to learning per week doesn’t appear to vary much with gender or continent, though sample size issues exist.
Expected earning varies strongly by continent. Females also appear to have a higher bottom line for expected salary than males. Those who dedicate more than 40 hours a week to learning data science/engineering appear to expect higher salaries as well.
For both subsets of the survey, there is no exceedingly strong relationship. All correlations are below 0.4.
Current salary and expected salary for an individual’s first data-related job has the strongest relationship for both subsets with correlations of 0.36 and 0.38.
Let’s dig deeper into the two strongest correlations, income against expected salary and student debt owed, using multivariate analysis.
The male/female wage gap is evident through each gender’s presence above the $100k lines. There aren’t enough data points for genderqueer and trans individuals to draw conclusions.
Ethnic minorities appear to be optimistic about the changing diversity landscape via their expected salaries. They have a notable presence above the $100k expected earning line, but not the $100k current salary line.
Current salaries and student debt levels for graduate students are relatively high, as expected. Bachelor’s degrees appear to have the worst student debt/current salary balance.
It appears that the student debt remaining vs. current salary relationship doesn’t differ much across hours dedicated to learning brackets.
It’s interesting that females expect their next salary to be as relatively low as their current one, but ethnic minorities expect a higher salary.
I’m also surprised more individuals with high levels of student debt aren’t dedicating 20+ hours to learning each week. I bet the current jobs of the individuals in lower time brackets are preventing them from increasing their pace.
The affordability of quality education online is a huge reason why I’m in the 40-80 hour bracket for my personalized data science master’s degree.
Males and non-minorities appear most frequently above the $100k lines. The wage gap is evident in current salary for both females and minorities. Though females appear to expect lower salaries than men, minorities are better represented above the $100k expected earning line.
Higher dispersion exists for the majority demographic in both cases. The relationship between expected and current salary is much stronger for the minority demographic.
The majority of individuals who pursued post-secondary education are above the $25k student debt remaining line. Compared to the data science/engineering subset, the lack of correlation between student debt and current salary for the full survey dataset now makes sense. The aforementioned skew towards graduate studies appears to instead be a skew towards post-secondary studies in general, however.
The highest proportion of individuals above the $50k current salary line belongs to the 0-10 hours dedicated to learning bracket. Proportions of individuals above the $25k student debt remaining line are similar across brackets.
Developing data scientists and engineers are slightly different than new coders in general.
The two datasets do share plenty of common trends. Demographics are similar. Most are willing to relocate. Most don’t use podcasts or attend events yet.
Diversity is still an issue in the workplace, as reflected in current and expected salary for females and ethnic minorities. Student debt owed matches well with current salary and higher levels of education. Most people aren’t replacing the traditional college/university route with fulltime online education…yet.
The successes of this exploration are largely due to the detailed design of the Free Code Camp survey.
The main struggle I encountered in this exploration was the lack of a main feature of interest, like the diamond dataset’s price variable. It would be awesome if we could survey the same respondents in a decade or so. We could combine career earnings and career satisfaction with the 2016 survey’s results to build a predictive model to estimate career success.
These are the people who are learning data science and engineering. It is clear that free, self-paced learning resources are important.